HTML Parsing
HTML parsing is the process by which the browser reads your raw HTML text (.html file) and converts it into a structured, in-memory representation called the DOM (Document Object Model).
1. HTML Source Code Arrives
When you open a webpage, the browser downloads the HTML file from the server.
<!DOCTYPE html>
<html>
<head>
<title>My Page</title>
</head>
<body>
<h1>Hello</h1>
<p>Welcome to HTML parsing!</p>
</body>
</html>
This is plain text — not yet structured or rendered.
2. Tokenization
The browser’s HTML parser starts reading this text character by character.
It breaks it into tokens — each representing an HTML construct:
- Start tags (
<html>,<body>) - End tags (
</body>) - Text nodes (
Hello, Welcome...) - Comments, attributes, etc.
< !DOCTYPE html >
< html >
< head >
< title >
My Page
</ title >
< body >
< h1 >
Hello
</ h1 >
...
3. DOM Tree Construction
As the tokens are recognized, the browser creates nodes and connects them hierarchically to build the DOM tree.
Document
└── html
├── head
│ └── title: "My Page"
└── body
├── h1: "Hello"
└── p: "Welcome to HTML parsing!"
- A tree structure representing all elements and their relationships.
- JavaScript and CSS interact with this tree.